Goto

Collaborating Authors

 markovian reward





Interpretable Reward Redistribution in Reinforcement Learning: A Causal Approach

Neural Information Processing Systems

A major challenge in reinforcement learning is to determine which state-action pairs are responsible for future rewards that are delayed. Reward redistribution serves as a solution to re-assign credits for each time step from observed sequences. While the majority of current approaches construct the reward redistribution in an uninterpretable manner, we propose to explicitly model the contributions of state and action from a causal perspective, resulting in an interpretable reward redistribution and preserving policy invariance. In this paper, we start by studying the role of causal generative models in reward redistribution by characterizing the generation of Markovian rewards and trajectory-wise long-term return and further propose a framework, called Generative Return Decomposition (GRD), for policy optimization in delayed reward scenarios. Specifically, GRD first identifies the unobservable Markovian rewards and causal relations in the generative process. Then, GRD makes use of the identified causal generative model to form a compact representation to train policy over the most favorable subspace of the state space of the agent. Theoretically, we show that the unobservable Markovian reward function is identifiable, as well as the underlying causal structure and causal models.


Learning Task Specifications from Demonstrations

Marcell Vazquez-Chanlatte, Susmit Jha, Ashish Tiwari, Mark K. Ho, Sanjit Seshia

Neural Information Processing Systems

In many settings (e.g., robotics) demonstrations provide a natural way to specify a task. For example, an agent (e.g., human expert) gives one or more demonstrations of the task from which we seek to automatically synthesize a policy for the robot to execute. Typically, one models the demonstrator as episodically operating within a dynamical system whose transition relation only depends on the current state and action (called the Markov condition). However, even if the dynamics are Markovian, many problems are naturally modeled in non-Markovian terms (see Ex 1).




Interpretable Reward Redistribution in Reinforcement Learning: A Causal Approach

Neural Information Processing Systems

A major challenge in reinforcement learning is to determine which state-action pairs are responsible for future rewards that are delayed. Reward redistribution serves as a solution to re-assign credits for each time step from observed sequences. While the majority of current approaches construct the reward redistribution in an uninterpretable manner, we propose to explicitly model the contributions of state and action from a causal perspective, resulting in an interpretable reward redistribution and preserving policy invariance. In this paper, we start by studying the role of causal generative models in reward redistribution by characterizing the generation of Markovian rewards and trajectory-wise long-term return and further propose a framework, called Generative Return Decomposition (GRD), for policy optimization in delayed reward scenarios. Specifically, GRD first identifies the unobservable Markovian rewards and causal relations in the generative process.


Interpretable Reward Redistribution in Reinforcement Learning: A Causal Approach

Zhang, Yudi, Du, Yali, Huang, Biwei, Wang, Ziyan, Wang, Jun, Fang, Meng, Pechenizkiy, Mykola

arXiv.org Artificial Intelligence

A major challenge in reinforcement learning is to determine which state-action pairs are responsible for future rewards that are delayed. Reward redistribution serves as a solution to re-assign credits for each time step from observed sequences. While the majority of current approaches construct the reward redistribution in an uninterpretable manner, we propose to explicitly model the contributions of state and action from a causal perspective, resulting in an interpretable reward redistribution and preserving policy invariance. In this paper, we start by studying the role of causal generative models in reward redistribution by characterizing the generation of Markovian rewards and trajectory-wise long-term return and further propose a framework, called Generative Return Decomposition (GRD), for policy optimization in delayed reward scenarios. Specifically, GRD first identifies the unobservable Markovian rewards and causal relations in the generative process. Then, GRD makes use of the identified causal generative model to form a compact representation to train policy over the most favorable subspace of the state space of the agent. Theoretically, we show that the unobservable Markovian reward function is identifiable, as well as the underlying causal structure and causal models. Experimental results show that our method outperforms state-of-the-art methods and the provided visualization further demonstrates the interpretability of our method. The project page is located at https://reedzyd.github.io/GenerativeReturnDecomposition/.


Hellinger KL-UCB based Bandit Algorithms for Markovian and i.i.d. Settings

Roy, Arghyadip, Shakkottai, Sanjay, Srikant, R.

arXiv.org Machine Learning

In the regret-based formulation of multi-armed bandit (MAB) problems, except in rare instances, much of the literature focuses on arms with i.i.d. rewards. In this paper, we consider the problem of obtaining regret guarantees for MAB problems in which the rewards of each arm form a Markov chain which may not belong to a single parameter exponential family. To achieve logarithmic regret in such problems is not difficult: a variation of standard KL-UCB does the job. However, the constants obtained from such an analysis are poor for the following reason: i.i.d. rewards are a special case of Markov rewards and it is difficult to design an algorithm that works well independent of whether the underlying model is truly Markovian or i.i.d. To overcome this issue, we introduce a novel algorithm that identifies whether the rewards from each arm are truly Markovian or i.i.d. using a Hellinger distance-based test. Our algorithm then switches from using a standard KL-UCB to a specialized version of KL-UCB when it determines that the arm reward is Markovian, thus resulting in low regret for both i.i.d. and Markovian settings.